Read the CSV data to get nested dictionary of structure data->label->kind, where labels are types of the cell medium (description is in the README) and kind is chemical modification of the nanoparticles, used to collect spectra (one of the NH2, COOH, (COOH)2). We will normalize the data to the (0,1) range, so define normalizing function in advance.

Meet the data

We will compare just fibroblasts - healthy and cancer ones, as they both were derived from patients.

Check the data read by observing keys and plotting couple of spectra.

EDA

Statistical testing

Firstly, Check if data is normally distributed

Sub-conclusion: p-values are controversary. Ok, really it's not applicable to compare all features

Savitzky-Golay filter

Data test for normality for savgol-filtered data

Sub-conclusion: after savgol filtering, p-values are controversary. Ok, really it's not applicable to compare all features, let's proceed to just features of interest

Working with peaks (features) of interest

Still, one third of peaks is distributed normally (p < 0.05) and others aren't.

Dividing to train_val and test

for savgol-filtered data

Correlation matrix

Data in one spectrum are higly correlated with each other (remember that for each sample we have 3 spectra)

Dimensionality reduction: PCA

Data preparation

PCA implementation

We can see that PC3 is highly correlated with class label.
Meanwhile, PCs are not correlated with each other, that is natural due to their indepencence

PCA Loadings

Machine Learning Modeling

According to PCA plots, Logistic Regression should divide data well.

a) Use all 3 spectra types (all 3 types of AuNPs)

Logistic Regression

Classification report

ROC AUC

Suspiciously perfect model performance. Therefore, let us use cross validation

Cross Validation

Feature importances

way 1. built-in function
Even with cross validation the results are too perfect. Then let's use only one type of AuNPs

b) Use only COOH spectra

Divide to Train, Validation, (Train+Validation) and Test populations

PCA

(repeat the same but with another set of data (one third of what was earlier)

PCA implementation

Now PC2 is correlated with class label.
Meanwhile, PCs are not correlated with each other, that is natural due to their indepencence

Logistic Regression

ROC AUC

Cross Validation

Feature importances

https://sefiks.com/2021/01/06/feature-importance-in-logistic-regression/

Conclusion:

Even with one type of AuNPs and cross validation the results are too perfect, although there wasn't any data leak.

For further research, it could be interesting to test more data, as now our overall dataset is limited to 100 samples.